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^ '. Abstract 

^ : 

' We consider random fields defined by finite-region conditional probabilities depending 

^ ■ on a neighborhood of the region which changes with the boundary conditions. To pre- 

p^J '. diet the symbols within any finite region it is necessary to inspect a random number of 

' neighborhood symbols which might change according to the value of them. In analogy 

■ to the one dimensional setting we call these neighborhood symbols the context associated 

p I , to the region at hand. This framework is a natural extension, to d-dimensional fields, of 

(-H I the notion of variable- length Markov chains introduced by Rissanen (1983) in his classical 
paper. We define an algorithm to estimate the radius of the smallest ball containing the 
context based on a realization of the field. We prove the consistency of this estimator. 
Our proofs are constructive and yield explicit upper bounds for the probability of wrong 

. _ ' estimation of the radius of the context. 

^ . Key words: Gibbs measures, random lattice fields, Variable-neighborhood random 
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§ '• 1 Introduction 

o : 

• We consider random fields on Z"^ with finite state space defined by prescribing a fam- 

J> . ily of conditional probabilities indexed by finite subsets A of Z,'^. We assume that these 

I conditional probabilities depend on a finite neighborhood which changes according to the 

^ • boundary conditions. Contrary to standard Markov random fields which are defined by a 



family of conditional probabilities depending on a fixed neighborhood and not sensitive 
to the boundary conditions (fixed order Markov dependence), the families of conditional 
probabilities considered here are not restricted to a predefined uniform depth. Rather, 
by examining the training data, a model is constructed that fits higher order Markov de- 
pendencies where needed, while using lower order Markov dependence elsewhere. We call 
these random fields Variable-neighborhood random fields or Parsimonious Markov random 
fields. 

Adopting this parsimonious description means that we are aiming at reducing infor- 
mation by finding the minimal neighborhood of a given block of sites able to predict 
the states of the sites within this block. The neighborhood changes when the outside 
configuration of the field changes, and the dependencies depend on the realization of the 
field. 
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Applications of Variable-neighborhood random fields are in image analysis: in texture 
synthesis, computer vision and graphics. We refer the interested reader to Efros and 
Leung (1999), [12], for the presentation of a non-parametric texture synthesis method. 
Texture is usually modelled as Markov random field, "composed of well defined texture 
primitives (texels) which are placed according to some syntactic rules" (Gidofalvi (2001), 
[22]). Thus a modelization as Variable- neighborhood random field where the width of 
the context window may depend on the realization of the field is natural. Other possible 
applications of Variable-neighborhood random fields are in neuroscience's and in general 
in spatial statistics, whenever information reduction is needed. 

The notion of Variable-neighborhood random fields has been inspired by Rissanen's 
Minimum Description Length principle for Markov chains, see Rissanen (1983), [27j . Ris- 
sanen calls the relevant neighborhood of a site, i.e. the sequence of symbols needed to 
predict the next symbol, given a finite sample, context of a site and proposes an esti- 
mator of the length of the context. Since this seminal paper, there have been several 
implementations and extensions of the method. We refer to the book of Grunwald (2007), 
[23], and to a review paper by Galves and Locherbach (2008), [T?J, for a comprehensive 
introduction. Results for the context algorithm can be found for example in Ferrari and 
Wyner (2003) and in Galves, Maume-Deschamps and Schmitt (2008), [18]. In this last 
paper, the rate of convergence of the context algorithm is established and non asymptotic 
error bounds implying the consistency of the estimator are obtained. All these results are 
related to processes in dimension one. Our aim is to extend this method to more than one 
dimension and to define an estimator of the context in the framework of random fields. 

This requires to define a random field which can predict the symbol at a given site by 
inspecting a "random" number of neighborhood symbols which might change according 
to the value of them. In analogy to the one dimensional setting we call this neighborhood, 
i.e. the subset of symbols needed to predict the symbol at the given site, the context of 
this site. For such random fields we estimate the radius of the context of a given site, 
i.e. the radius of the smallest ball containing the context of this site. It is enough to 
consider the contexts for one site, since in our setting the one point specification uniquely 
determines the specification for any other set. We apply a penalized pseudo-likelihood 
method, first introduced by Besag (1975), [2j, and developed by Csiszar and Talata (2006), 
[1], in order to construct our estimator. Our estimator is a function of the observed blocks 
or patterns appearing in the sample. It is based on a sequence of local decisions between 
two possible values of the radius of the context, lumping them together whenever their 
corresponding one point conditional probabilities are similar. We propose an estimator 
for any site within our observation window, depending on its local neighborhood. Hence 
we deal with a family of estimators indexed by the centers of observation patterns. For 
this family of estimators, we give in Theorem 13.51 and Theorem 13.91 explicit error-bounds 
for the probability of over- and underestimation. These bounds are non asymptotic with 
respect to the number of observed sites, i.e. the size of the observation window. As a 
consequence, we obtain the consistency of the neighborhood radius estimator. 

Our results are based on several deviation inequalities which are interesting in its own 
right. They are collected in Sections H] and O The first part of them (Section [3D is based on 
results obtained by Dedecker (2001), [6J, on deviation inequalities for random fields, the 
second part (Section [5|) is a rewriting of typicality results obtained by Csiszar and Talata 
(2006), [4J. Csiszar and Talata are only interested in consistency and they do not give 
explicit upper bounds for the error probabilities. We want to control the error bounds, 
for any fixed n, and so we carry on their ideas into non-asymptotic deviation inequalities. 

We implement the estimates under the requirement that the one point conditional 
probabilities are strictly positive. This is enough for the overestimation. To implement 
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the estimates for the underestimation, we need to assume that Dobrushin's contraction 
condition holds, see Dobrushin (1968), [9] and [10], and that there exists some finite order 
L, unknown to the statistician, such that the random field is Markov of order at most L. 
In the language of context-trees this means that we deal with finite trees only. 

There is large number of papers devoted to parameter estimation for Markov random 
fields when the structure of the interaction is known, see for example Gidas (1993), [21], 
Comets (1992), [3], Dereudre and Lavancier (2011), [8], and many others. Typically, 
the parameter estimation addresses the problem of estimating parameters entering in 
determining the potential, but not directly the conditional probabilities. Quite recently, 
the non-parametric problem of model selection has been addressed, i.e. the statistical 
estimation of the interaction structure, see for example Ji and Seymour (1996), [24j . 
Csiszar and Talata (2006), [4J, propose to estimate the basic neighborhood of Markov 
random fields and estimate the support of the neighborhood (i.e. its geometrical structure) 
which is relevant to determine the conditional probabilities. In their framework this 
neighborhood does not depend on the configuration, hence they work in a strict Markovian 
frame. In Galves et al. (2010), [19], a related problem has been studied. The authors 
estimate for an Ising model having pairwise interactions of infinite range the pairs of 
interacting sites based on i.i.d. observations of the field. Our paper is not situated in the 
same framework. We do not address the problem of estimating the geometrical structure of 
the contexts, since this would require to introduce too many free parameters. We deal with 
a problem which is simpler and more difficult at the same time: we estimate only the radius 
of the basic neighborhood, but this neighborhood varies when the configuration changes. 
This last feature is the main difference from previous models which have appeared in the 
literature. 

The paper is organized as follows. In Section 2 we define the Variable-neighborhood 
random fields, based on the prescription of a "variable- neighbor hood" -specification and we 
provide examples. In Section 3 we define the estimator of the radius of a single-site context 
and formulate the main results. In Theorem 13.51 we give the bound on the probability 
of overestimation and in Theorem 13.91 the bound on the probability of underestimation, 
under suitable assumptions on the decay of correlations in the field. In Section 4 we 
prove the deviation inequalities needed for controlling the underestimation and in Section 
5 those needed for controlling the overestimation. In Sections 6 and 7 we give the proof 
of the main results. We conclude with some final remarks in Section 8. In Section 9, the 
appendix, we collect some mathematical tools needed along the way. In particular we 
prove a relation between single site contexts and contexts of finite sets of sites. 
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2 Variable-neighborhood random fields 



We consider the d dimensional lattice Z'^. The points i are called sites, ||i|| denotes 
the maximum norm of i, i.e. for i = (ii, . . . ,id), \\i\\ = max(|2i|, . . . , is the maximum 
of the absolute values of the coordinates of i. The cardinality of a finite set A is denoted 
by |A|. The notations C and (s denote inclusion and strict inclusion. Subsets of Z"' will 
be denoted with uppercase Greek letters. If A is a finite set, we write A (S Z"'. 

A random field X is a family of random variables indexed by the sites i of the lattice, 
{Xi : i G Z*^}, where each Xi is a random variable taking values in a finite set A. 

We denote the set of all possible configurations of the random field by 

n = A^\ 

where O is endowed with the product topology. We adopt the following notational con- 
ventions. We write wa G A.^ for the restriction of the configuration oo to the subset A. 
If A = {%} is a singleton, we shall write uj{i) for ijJ{i}- Configurations defined by regions 
are factorized with omitted subscripts indicating completion to the rest of the lattice: 
^AVA" = ^AV- We call local configurations the elements of U^g^d^^. 

We identify the random field {Xi : i E Z'^} with the coordinate maps Xi by Xi{uj) = 
u}{i), for any w E $7, and from now on we will use this canonical version of the random 
field. We define the following cj— algebras: For any T C Z*^, let 

JTp = a{Xi : i E r} and J" = a{Xi : i E Z'^}. 

In this setup a random field is just a probability measure on the product space ($7,7-"). 
This measure is defined by local specifications. To define them, we recall the following 
well-known notions in statistical mechanics, see Georgii (1988), [20j. 

Definition 2.1 A probability kernel on (i7,-F) is a function T{ - \ •) : x — > [0,1] 
such that 

(a) r( • I w) is a probability measure on (Q,J^), for each w E il, 

(b) T(A \ ■) is a -measurable function for each T . 



Definition 2.2 A specification on {^,J^) is a family 7 = {7A}A(gz<* of probability kernels 
on {^l,J-) such that 

(a) For each A d Z"^ and each A^ T , the function 7a(^ | • ) is J- \<: — measurable, 

(b) For each K(^'L'^ and each A E J-A'^, 1a{A \ uj) = lyi(a;), 

(c) For any pair of regions A and A, with A C A d Z"', and any measurable set A, 

j 7A(da;' | w) 7a(^ | w') = 7a(^ I oj) (2.1) 

for all (J E ri. 
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Definition 2.3 A probability measure fi on {i^,J^) is consistent with a specification 7 if 
for each A d Z'^ and for each A & 

j ii{du) 7a(^ I a;) = ii{A). (2.2) 

We now define the Variable-neighborhood random fields. 
Definition 2.4 Variable-neighborhood random field 

Let n be a probability measure on {Q,J-) consistent with the specification 7. Then fi is 
a variable-neighborhood random field if for any A d Z*^ and for fj,— almost all u)\c the 
following holds: there exists T = T{u!) d Z"^ such that 

7a(- I '^aO = 7a(- I wr), 
and for all T C Z'^, if 'Ja{- \ ^A'^) = 7a(' | f^f )i then F C T. We denote 

spA(a;) = r(w) and ca(w) = a;sp^(^), (2.3) 
the restriction of oj on the set spa(w)- 

Remark 2.5 In Definition \2.4\ the requirement that if ^\{- \ oja'^) = 7a(' | "^f)' ^^^n 
r(a;) C r allows to identify in an unambiguous way the random set V{uj) on which 7a(- | 
a;A<^) depends on. 

Remark 2.6 We call the Variable-neighborhood random fields also Parsimonious Markov 
random fields. Namely 7a('|'^A'^) depends only on a;sp^((^) and we do not need to inspect 
the whole configuration oj^c in order to decide about the configuration of symbols within 
A. Indeed it is sufficient to inspect (^spa(i^)- 

According to Definition 12.41 there might be a set of realizations of ^—measure zero so 
that |spa(w)| = 00. From now on we assume that for all oj £ ^l, spa(w) is a finite set. 
This means that for all a; € il, 7a(- | <^a<:) does only depend on a finite, but random 
neighborhood of A. When for some Fq cs Z'^ , spa(w) = Fq for all w, then /i (respectively, 
X) is a Markov field with basic neighborhood Fq. Define the cr— algebra 

-Fsp^ = {a G ^ : VF C Z'^ : {spA = F} n A G J-p} . (2.4) 

Then for all uja G A^, 7a({'^a}|') is a measurable function with respect to -T^spA- 

In analogy to the terminology used for one dimensional variable length Markov chains 
we can rephrase the Definition 12.41 using the concept of family of contexts. This generalizes 
the notion of context trees to more than one dimension. 

Definition 2.7 The family of contexts associated to the specification 7 

For A (E Z*^ and u £ Q we denote by c]^{uj) = c\{oj), see ()2.3p . the K— context of uj 

associated to the specification 7. We write = {cA(w),a; G VL} for the family 

of A-contexts. Under our assumptions, 

r(^) C U A^. (2.5) 

We use the short-hand notation Ci{u) for C|j}(a;), spi(a;) for sp{i}(cj) and 7i(a|a;) for 
7|,|({a}|a;), iGZ'^. 
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Remark 2.8 It is immediate to verify from Definition \2.4\ that the family r^^) has the 
following properties: 

• No element of t^^^ is restriction of any other element of t^^\- If tjy and rj^ both 
belong to t^^\ T cT and rj-p = r^r, then F = F. 

• r^^) defines a partition of A^''^^, that is, for each ui G A^''^^^ there exists a unique 
r CZ'^\A such that ujr G t(^). 



In this way the family of local specifications associated to fi is 

7 = {7A(-|cAH),A(£Z^;cA(aj)} (2.6) 

which leads to a more parsimonious description than the original 

{7A(-|'^A0),A(£Z'^;a;Ac}. (2.7) 

We close the section presenting three examples. In the first one we embed a renewal 
process in a one dimensional Variable- neighborhood random field. This example has been 
suggested by Ferrari and Wyner (2003), |14) . In the second example we construct a two 
dimensional Variable-neighborhood random field by specifying a variable-neighborhood 
interaction potential. In the third example, we define a two dimensional Voronoi cell 
interaction model. This example has been inspired by Dereudre, Drouilhet and Georgii 

(2010), m- 

Example 2.9 We consider A = {0,1}. Let {Xn : n G Z} be a stationary alternating 
renewal process taking values in A, i.e. the times when the process switches between 1 and 
or and 1 are independent and identically distributed random variables. They have the 
same distribution as the random variable T defined through 

F[T = j] = cip{ + C2pi, 0<p2<Pi<l, ieN. 

Set fi = E[T]. This process can be realized as a one dimensional Variable-neighborhood 
random field. To this aim define for a and b in Z 

Rb{io) = inf{n > 6 + 1 : uj{n) ^ uj{b + 1)}, La{uj) = sup{n < a — 1 : oj{n) ^ uj{a — 1)}, 

(2.8) 

where uj G A^. The family of local specifications 7{[a,6]} indexed by [a,b] C Z, is given as 
follows: 

7{[a,b]}(- I C[a,b](w)) = 7{[a,b]}(- I La{uj) = -k,Rb{uj) = I). 

The context cj^ (w) depends only on the neighbor sites of [a, b] which are all of the same 
type or 1. By standard calculus we obtain the following formulas for the one point 
specification 7{o}(" I co(a;)). Write spo(ci;) = [Lo(a;), Ro(c(;)] \ {0}. Then 



7{0}(1 I Lo{uj) 



-k,Ro{uj) = l,uj{l) = uj{-l) 

j+fc-i 



ciei+c2g2 



C2 



Q2 



(2.9) 
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and 



7|0}(1 I Lo{co) = -k,Ro{u) = l,uj{l) = 0,uj{-l) = 1) 

{ciQi + C2Q2) (ci^'f ^ + 0292^^) 

= _ . ^ . ^—^ . (2.10) 

{ciQ^ + C2Q^) (^Ciq[ ^ + C2£'2 + (ci^t ^ + C2Q2 (cie^'i + C2Q2) 

Due to the symmetry between and 1, it is clear that with formulas ()2.9p and (j2.10p . we 
/ia?;e completely described the one-point specification. 

In this example the context co(-) is F— almost surely finite, i.e. there exists a subset of 
configurations of ¥— measure zero for which |co(a;)| = oo. 

We now give an example of a Variable-neighborhood random field in dimension 
d = 2. In analogy with the procedure used in statistical mechanics we define a variable- 
neighborhood specification by introducing a variable-neighborhood interaction potential. 



Example 2.10 We consider A = { — 1,1} and d = 2. In order to define the support of 
the variable-neighborhood interaction potential it is convenient to embed 1? into ffi^. We 
partition into cubes of edge 1 centered at 1? . We say that two cubes are connected if 
they have one face in common. We denote by TZ the set of all connected unions of such 
cubes, by R an element ofTZ and by dR the topological surface of R. 
We say that T C Z,'^ is a polygon if there exists R ^ TZ so that T = Rr\l?' . We denote 
by dV = {i e T : d{i,dR) < where d{i,dR) = inf{||z — j\\ : j G dR} and \\ ■ \\ is the 
maximum norm introduced at the beginning of this section. Finally, let t be the interior 

ofr, f = r\ar. 

We say that F is a simple polygon if dV is a path in 1? which does not cross itself and 
r 7^ 0. Note that dT can be the union of disjoint connected paths. 
Given oj £ A^'^ we define for each i ^1? 

T\{uj) = n{r C I?,T simple polygon ,i G T,u!Qr = 1 }• 

In the above definition we do not require T to be finite. F could be equal to 1? in which 
case dV = 0. In order to get a bounded interaction range, we set 

r,{oj) = Vi{L) n Tj{uj) and cf (u) = {w, : j e Ti{uj), j^i}, 

where Vi{L) = {j £ 1? : \\i — j\\ < L}. In the above definition, the superscript K un- 
derlines the fact that the context cf{uj) for the interaction might not be the same as the 
context Ci{uj) associated to the specification. Let {Jn,n- G N} be a collection of real num- 
bers and |rj(a;)| the cardinality ofTi{uj). We define the variable neighborhood interaction 
{K'^{uj),i G Z^} as following: 

K\u)=K\cf{u)) = J\r^^^^)\ n ^(^■)- 

By construction, the interaction is summable: 

sup sup \K^{uj)\ < oo . (2.11) 

Denote 

Hk{ujk,ojk-) = - ^ K\uj). 

{iez2;AnriH^0} 
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The Variable-neighborhood random field fj, is determined by the following family of local 
specifications 

7a({wa} I wac ) = — ^exp{-/3i?A(tJA,WAc)}, ujjy E A^,uJAc e (2.12) 

where 

^Ae{-i,i}A 

The family of contexts ca(<^) = ^spa{uj) o-ssociated to {7a}a; defined in (|2.12p . is deter- 
mined by Ci{uj) for i £ A, therefore by the knowledge of to only on spi(w). By Definition 
\2.4\ and by 112.12]) we have that 



spi(cj) 



U {rj(a;) : i G rj(a;)} U U {rj(a;^) : i E T^ioj')} 



\{i}, (2.13) 



where = for all j ^ i, uj^{i) = —uj{i). This formula gives the relation between 

the support of the context of the specification and the support of the interaction. We show 
in the appendix that the following identity holds: 

spiiu) = {Tjiuj) n Vi(2L)) \ {i}, Ci(a;) = a;,p^(,). (2.14) 

Note that the context associated to the family of the constructed specification Ci{uj) is 
different from cf (w), the context associated to the interaction. 

We now give a second example of a Variable-neighborhood random field in dimension 
d = 2 in which contexts are no more bounded. This example is inspired by a recent 
paper of Dereudre, Drouilhet and Georgii (2010), [7J in which Gibbs point processes on 
M"^ with geometry dependent interactions are considered. Such processes can be realized 
as Variable-neighborhood random fields. Compared to their work, our setup is simpler 
since we do not work on M? but on the grid Z^. 



Example 2.11 We consider A = {0, 1} and d = 2. For u G A^ we denote by C{uj) = 
{i £ I? : oj{i) = 1}. We embed 1? into M^. For any fixed uj e A^ with C{uj) / and 
i £ C{uj), let 

Vori{uj) := {y G : ||i - y\\2 < \\k - y\\2, VA; G C{uj), k i} 

be the Voronoi cell associated to a point i G C{u)). If there exists no such k G C{uj), then 
set Vori(oj) = Tl?\ {i}. We denote by || • ||2 the I?— norm on M^. For a bounded function 
<I> : 'P(M^) —7- R defined on all subsets o/M^, we define for any i G C{ijj), 

K\uj) = ^{Vor,{uj)). 

For i i C(w), we set K^{u}) = 0. When C{uj) = set K'{uj) = for all i e I? . Finally, 
let 

Clearly, the range of interaction is not bounded in this case. We have to ensure that the 
interaction is summable. Since for any fixed i £ 7? and A; G N 

card{j el? ■.iel?r\ Vorj{uj), \\i - j|| = A:} < 8A;, 
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where \\ ■ \\ is the L°° —norm, see at the beginning of Section 2, it suffices to impose that 

c 

diam(A)'^ 



for any ^ C M^, ^{A) < -j. — tW+f, for some constant C and e > 0. Then, 



sup sup ^ \K\uj)\ < oo . (2-15) 

The Variable-neighborhood random field is then determined by the following family of 
local specifications 

7a({^a} I ^aO = 7?7— exp{-/3i?A(tJA,WA0}i ^A G A^,ujac G (2.16) 

where 

Z'-Ac = exp{-/3HA{ujA,ujAc)}. 

^^a6{0,1}A 

One can generalize in a relatively straightforward way the last two examples to d > 3. 

We are interested in estimating the support of the context spa(<^) for a given set of 
sites A (£ Z"^ and a given observation uj. Proposition 19.11 of the Appendix shows that 
7a( • I ca(w) ) can be derived from the one point specification 7j( • | Ci{uj) ) and that for 
A (E Z'^, we have 

spa(^^) = (UieAsp{i}(t^)) \ A. (2.17) 

Hence, in order to estimate spa(w), it is sufficient to estimate the context for single sites, 
i.e. spi(a;). To implement the estimation procedure we need translation covariant models. 

For any fixed i G Z"^, we denote by Ti ■.'L'^ ^ WJ^ the i— shift defined by Tj(j) = i + j. This 
naturally induces on the i— shift Tj : — )• defined by 

{Tiuj){j) = u{Ti{j)) = uj{i + j) Mj G Z'^. 

Definition 2.12 A Variable-neighborhood random field fi on {^,J-'), determined by a 
family of local specifications {7a}A) is translation covariant if for all A (e Z'^ and for all 

7a(-|w) = 1nA{-\TiUj), i e Z"* 

where TjA = A + i. 

In the following we will consider only translation covariant Variable-neighborhood random 
fields. This implies that 7i(-|ci(Tj(tt;))) = 7o(-|co(w)). 



3 Main Results and Estimation procedure 

In Section 2 we introduced the notion of Variable-neighborhood random fields. Such a 
random field is completely determined by the one point specification. It would therefore 
be interesting to estimate spi(a;), i.e. the set of points in Z"^ which enables to determine the 
value of the symbol at the site i. This requires, however, to estimate too many unknown 
parameters. Therefore we are less ambitious and estimate the radius of the smallest ball 
containing spi(6t;). For £ > 1 and i G Z*^, define 

V{£) ={jeZ'':\\i- ill < i} and y,°(£) = Vi{£) \ {i}. (3.1) 
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We also write 

dV^{£) = {jeZ^:\\i-j\\=£}. 
Then we define the length of the context of site i by 

li{uj) = mf{£ > : spi(a;) C Vi(£)}. (3.2) 

Note that li{uj) is a stopping time with respect to the filtration (^^)„ = iJ-^Vi{n))n- 

Recall that a; G O = A^'^ stands for a generic configuration of the field. In order to 
distinguish between generic configurations and observed data, we will denote the observed 
data by a. Our statistical inference is based on observations of the Variable-neighborhood 
random field /x over an increasing and absorbing sequence of finite regions A„ C Z"^, 
i.e. A„ C A„_|_i C Z*^ for all n, and for all A' C Z'^, there exists n such that A' C A„. 

Hence, at step n, the sample is cja„ , where (Ja^ is the fixed realization of fi in restriction 
to A„. We will construct our estimators based on sites within some security region A„ C 
An, where 

An = e A„ : Vi{k{n)) C A„} (3.3) 

with 

Mn) = (log|A„|)^. (3.4) 

In order to estimate li{co), we have to compare the neighborhood configuration of site 
i with the neighborhood configurations of different sites j for all j G A„. To do so we 
define for any fixed i G A„ and any 1 < i < k{n), 

x!{co) = {Xf{u){j)=u{i + j), j:0<\\j\\<£}, (3.5) 

hence Xf{ijj) is the configuration around i in a box of edge £. In terms of the shift operator, 
X^{aj) = {uj o ri))yO(^'), i.e. this is the restriction of Tjo; to Vo^(^). We stress that Xf does 
not depend on the center of the observation window, and this is important to our 
purposes. We shall use the short-hand notation 

Xf{u) = uf. 

For any 1 < £ < A;(n), for any fixed configuration 77 G .4^0°^^^, let 

Nniv) = E hx^,} (3-6) 

be the total number of occurrences of rj within A„. Moreover, for any fixed value a E A, 
we write 

Nn{v,a) = ^ \x'.=n,xj=a}- (3-7) 

In particular for the observed data cr, erf is the data observed around the site i in a ball of 
radius £, Nn{(7f) is the total number of occurrences of the local pattern around i within 
A„. By construction Nn{af) > 1. Note that Nn{af,a) could be zero. 

Let 7 : ^ X ^^o'(^) — )• [0, 1]. 7 is interpreted as possible one-point specification of a 
hypothetical Markov random field for which the corresponding context is contained in 
Vi{£). For any site i, under the hypothesis that its context is contained in Vi{£), we define 
the pseudo-likelihood of j as follows: 

PL'ti'^Hl) = n = n 7(akf)^"('^-«\ (3.8) 
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where we restrict the product to all sites j G A„ in order to be sure that Vj{i) is still 
contained inside the observation window A„. Maximizing ()3.8p with respect to 7 under 
the constraint 

gives the following estimator of the one-point specification 

J _ Nn{af,a) 



Pn[a\cri 



Analogously, we can define for any fixed configuration rj G 



Pn{a\rj) 



^ ifiVn(r?)>0, 
otherwise. 



(3.9) 



(3.10) 



Remark 3.1 Not all j satisfying X^ag_4 7(a|(7^) = 1 are possible one-point specifications; 
one point specifications have to satisfy additional conditions, which are collected in the 
appendix, see (|9.2p . and which are not considered here. However, we define the pseudo- 
likelihood also for 7 not satisfying these additional conditions. 

Thus, given the sample o"a„ , the logarithm of the maximum pseudo-likelihood of 7 is 
the following quantity: 



logMPL„(i,^) = ^Nn{al a) log pn{a\af 
aeA 



(3.11) 



The decision if for a given i the context has radius £ — 1 rather than i is based on the 
Kullback-Leibler information. We introduce 



logL„(^,^) = Nn{at'v)Dipn{-\at'v),Pn{-\cTt')), 



(3.12) 



where we sum over all possibilities of extending ^ to a configuration of radius i 

and where 



D{M-\at'v),pni-\at')) = Y.Pn{a\at'v)log 

aeA 



Pn{a\(^i V) 



is the Kullback-Leibler information. Note that log Ln{i,i) is a function of a- , but not 
of (jf . We rewrite it as follows: 



logLn(^,^)= Yl 7rTFi^^"^^^'")^°^ 



7 A =a- 
J 3 I 



' Pn{a\Xl) 
Pn{a\X'-'] 



Finally note that 



(3.13) 



log Ln{i,l) 



E 



■logMPLnij,£) 



logMPLnii,i-l). (3.14) 
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Now we start from i = Rn and proceed successively from itoi—1. The log likelihood ratio 
statistics log Ln{i, t) will be basically equal to zero for all £ > The first range at which 
logL„(i,£) is significantly different from zero is a range such that Pn{'Wi~^) / Pn('|cf) in 
which case it is reasonable to suppose that i = li{cr)- 

Before formalising this intuition in the definition of the estimator, for technical reasons 
we have to introduce the following security diameter 

Rn = [{log \An\)^ , (3.15) 

where [■] denotes the integer part of a number. Note however that 

Rn/k{n) — )• 1 as n — )• oo, 

where k{n) was defined in (|3.4p . We are now able to define the estimator of the context 
length function. 

Definition 3.2 The estimator Given the observation cta^, for any i G A„, see (jS.Sp . 

the estimator ofli{a), defined in ()3.2p . is the following random variable 

ln{i) = Inii, o") = min{^ = 1, . . . , i?„ — 1 : \/ k > £, log k) < pen{k, n)}, (3.16) 

whenever {£ = l,...,i2„ — 1 : \/ k > i, log Ln{i, k) < pen{k,n)} / 0. Otherwise we set 
ln{i) = Rn- In the above definition, 

pen(^,n) = K|^||^|l^^«Wllog|A„|, (3.17) 

and K is a positive constant that can be chosen freely, provided it is at least of the order 
given in ()3.18p . 

In other words, the above estimator chooses the minimal length i such that all sites which 
are relevant to determine the value of the symbol at site i belong to a ball of radius £. 
Once we have estimated the context length function, the underlying context Cj(cr) is then 
estimated by 

and the corresponding one point specification by ^n,iicL\a) = pn{a\cn,i{cr)). 



Remark 3.3 1. In one dimension the above penalization term is independent of 
since in this case = \A\'^. This leads to a penalty term 

pen{n) = K\Af log \An\. 



2. Once the statistician has determined the radius of the context £ = li{a) by means of 
the estimator ln{i), is is possible, in a second step, to determine the geometry of the 
context, i.e. to estimate Cj(cr) itself. This can be done by adapting the estimator of 
Csiszdr and Talata (2006), /^/, to our setup where the penalty can be restricted to 
all shapes contained in Vi{£) for £ = ln{i)- 
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3.1 Main results 

The estimator depends on the penahzation term, ()3.17p . therefore on the choice of 

I 3e 



the constant k. Choose S > 2*^102; 1^1 .^'^ and define 



1/2 



K = k{5) = b'^ [^j 6. (3.18) 
For the estimator defined in this way the following theorems are our main results. 

Assumption 3.4 The local specification is positive. We define 

Qmin = inf inf 7o(a|a;) > 0. (3.19) 



Theorem 3.5 (Overestimation) Let fj, be a translation covariant Variable- 
neighborhood random field for which Assumption \3.4\ holds. For any e > there 
exist uq = nQ{e,5,qmin) cLnd c{5) = c{6,qmin), so that for any n > uq the probability of 
overestimation is bounded by 



< 



BieAn-. ln{i) > k{<y) 

C(d)(log lAnl)"^ • exp {-c{5)^\og\K\^ + C{d) exp (-|A„|^-^) , (3.20) 

where C (d) is a positive constant depending only on the dimension and where A„ is given 
in 



Remark 3.6 To obtain an upper bound in (|3.20p summable in n, we need a fast increase 
of the sampling regions of order for example 



log|A„|~(logni+^)2, 
which requires faster increase than choosing A„ = [— f , f]*^- 

For bounding the probability of underestimation we need an additional assumption. To 
this aim define ^ 

r{i,j)= sup i:hi{-\^) - TV 

where || • || tv denotes the total variation norm. By translation covariance r(i,j) = 
r(0, i — j). We denote 

m= E ^(0'^)- (3.21) 

k&Z'i:\\i\\>l 

Assumption 3.7 We assume that there exists L > such that 

r{0,i)=0 forall\\i\\>L (3.22) 

and 

r= r{0,i)<l. (3.23) 

jGZ'*\{0} 
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Remark 3.8 Condition S3. 22\) implies that the observed random field is actually a 
Markov random field of order L. The order L, however, is unknown. We do not propose to 
estimate this unknown order L. When passing to the parsimonious description \2. 6\) . what 
we actually propose is to estimate, for every site i, given the observation a, the minimal 
order li{(T) that we need in order to determine the specification at that site, given a. This 
is also called Minimum Description Length in the literature. However, if li[a) does not 
depend on the configuration, then our estimator naturally provides an estimator of L. 

Condition (|3.23p is the Dobrushin condition which implies uniqueness of the measure 
fi, see Dobrushin (1968), |f , /76j/. 

For the Example \2.10\ and Example \2.11\ thanks to the summability assumptions ()2.1ip 
and ()2.15p . there exists for both of them a critical value of the temperature f3c such that 
()3.23p is satisfied for all [3 < /3c. For the Example \2.9\ . condition (|3.23p is verified. This 
can be shown as in the paper by Ferrari and Wyner (2003), ^3. 

Theorem 3.9 (Underestimation) Let fi be a translation covariant Variable- 
neighborhood random field for which Assumption \3.4\ and Assumption \3. 7| hold. Then 
for any e > there exists uq = nQ{e,6,qmin,L), so that for any n > uq the probability of 
underestimation is bounded by 

< exp(-|A„|^-^). (3.24) 

Remark 3.10 1. The above results are stated for all n > uq where uq depends on the 
(unknown) model parameter qmin and on the interaction through L. It is possible 
to write down upper bounds which hold for all n, but then the bounds become more 
complicated and depend on qmin o,nd on L. We adopted the above way of writing in 
order to state the results in a more transparent way. 

2. Note that the trade-off between the rates of the two kind of errors (exponential con- 
vergence for the probability of underestimation in (|3.24p and (basically) polynomial 
convergence of the probability of overestimation in ()3.20p ) is a typical feature in 
problems of order estimation appearing already in the simpler problem of order es- 
timation for Markov chains, see e.g. the papers by Finesso et al. (1996), JJEj, and 
Merhav et al. (1989), f2^ . 

This represents the usual trade-off between type one and type two errors in statis- 
tical decision problems: Overestimation means that the estimate exceeds the true 
order and that we choose models that include the true data- generating mechanism. 
This choice is not optimal but does only lead to a higher cost. On the other hand 
underestimation leads to a restriction to lower order models that do not describe the 
observed data. 

So it is desirable to have an exponential control on the probability of underestimation 
while keeping some polynomial control on the probability of overestimation. 

3. The definition of our estimator depends on the parameter 6. This plays an important 
role only for the overestimation. Namely it appears in the exponent of the upper 
bound through the constant 

c{8) = o 2 log ^ , 

3 e 



3i G A„ : ln{i) < li{a) 
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(see end of the proof of Lemma \5.S^) . To ensure the consistency of the estimator we 
need to choose 6 sufficiently large, depending on the one-point specification and on 
Qmin such that c{5) > 0. Therefore, our estimator is not universal, in the sense that 
for fixed 6 it fails to be consistent for any random field such that c{6) < 0. 

This problem appears even in the simpler case of order estimation for Markov 
chains, see for example Finesso et al. (1996), fl^ . and Merhav et al. (1989), 
\25^ . As pointed out by Finesso et al. (1996), I15f . it is not possible to have an 
exponential bound on the overestimation probability of an order estimator without 
rendering it inconsistent, for at least one model, for the underestimation. 

Remark 3.11 Let us finally compare our results in the case of dimension one to the 
results of Ferrari and Wyner (2003), I14f- They consider stationary chains taking values 
in a finite alphabet without imposing any a priori bound on the memory. Hence, they 
are dealing with infinite trees. They overcome this difficulty by approximating the possibly 
infinite memory chain by a sequence of finite range Markov chains of growing order. 
The price to pay in order to deal with these general processes is to impose geometrically 
a— mixing conditions both for the control of the over- and the underestimation. 

In comparison to their results, to control the underestimation, we need a slightly 
stronger assumption. We require geometrically ^—mixing which implies geometrically 
a— mixing. This is crucial to obtain Theorem \3.9i We use Condition \3. 22\) as sufficient 
condition to obtain the geometrically ^ — mixing. 

Condition i3.22\) . which implies that the random field is of finite range, could probably 
be relaxed. It should be possible to deal with infinite range models, provided one finds other 
sufficient conditions implying mixing. 

Note however, that mixing implies automatically the uniqueness of the underlying mea- 
sure jjL. Hence, using this kind of technique always implies that the Dobrushin uniqueness 
condition i3.23\) must be satisfied. There is some hope to deal with the underestimation 
even in the case of phase transition, see Remark \4.3\ 

Concerning the control of the overestimation, we are able to deal with the general 
long range case without requiring mixing. Hence we can do better than Ferrari and Wyner 
(2003), \14h in this aspect. We need to impose the positivity condition on the specification, 
see Assumption \3.4\ Ferrari and Wyner do only impose positivity within each step of the 
canonical Markov approximation, allowing these lower bounds to tend to zero at a certain 
rate. 

4 Deviation inequalities for underestimation 

The deviation inequalities needed for the underestimation are based on results obtained 
by Dedecker (2001), on exponential inequalities for random fields. To adapt these 
results to our model we need Assumption 13.41 and Assumption 13. 71 In the next subsection 
we present a preliminary deviation inequality based on the results of Dedecker (2001), [B], 
and then give in the following subsection the deviation inequalities needed to control the 
probability of underestimation. 

4.1 Preliminaries 

Fix £ > 0. For a given configuration rj € A^o(^)^ we define 

p(7?) = = 7?,, yiGVo^m. (4.1) 
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Recall that Nn{rj) is the total number of occurrences of r] in the observation cr^^- Then 
we get the following result which is an immediate consequence of Corollary 4 of Dedecker 
(2001), [6J. 



Proposition 4.1 Under Assumption \3.4\ and Assumption 3.1 there exists a constant 
c{d, L) depending only on the dimension and on the range L, such that for any con- 
figuration Tf G AYo , 

-"(')-p(,,|,.),e-e,p(-£™i^). 



I An 



The remainder of this section is devoted to show how this result can be obtained as a 
consequence of Corollary 4 of Dedecker (2001), [6j. We give this proof in detail since this 
shows at which extend Assumption 13.71 is needed. 



Proof. For any i, let 

Then under /i, {Yi : i G Z'^} is a stationary random field. The associated filtration is 
defined as follows. For any F C Z'^, let 

g'r = a{Yi,ier}, 

and define the <I>— mixing coefficient 

HG'r^Q'r,) = snp{MB\g'r,) - KB)\\^,Bg G'v,]- 

Moreover, let 

^i,^^{n) = sup{$(g^ri,a'r2) : iFal = 1, dist(Fi, F2) > n}, (4.3) 
where dist(ri,T2) = min{||j — «||,« G ^i-,3 £ F2}. Let 

/3, = l + ^<i(n)|5yo(n)|. (4.4) 

n>l 

The quantity /?£ depends on £ through the filtration {^f,F C Z*^}. To avoid confusion we 
warn the reader that /?£ defined in (j4.4|) is a different quantity from f3{l) defined in (j3.21|) . 
although related. Corollary 4 of Dedecker (2001), [6], implies the following exponential 
inequality 



Nn{ri) 



p{ri) >5J <ei/^exp(^-^|-J. (4.5) 

We estimate in Lemma [4.2[ stated below. Then, defining c{d,L) = j(j^fj^, where 
C {d, L) is the constant of Lemma 14. 2| we obtain (j4.2p . • 



A„|(5- 



|A„ 



Assumption 13.71 is essential for proving the following lemma. 



Lemma 4.2 Under Assumption 3.1, there exist constants c* = c*{L) and k = k{L) such 
that 

<i(n + 2£)<c*|yo°(^)|e-'=", 
and for /3i defined in ()4.4p , we have 

/3, <C(d,L)(2£)2'^^i (4.6) 

where C{d,L) is a positive constant depending on the dimension d and on L. 
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Proof. For any T ^ Z"', let 

T{£) = {ieZ'^ : d{i,T) < £}. 

We have that, whenever |r| > 1, 

g'r = (7{Yi,i G r} c a{Xi,i G r(^)} = Jr^. 

When |r| = 1, assuming T = {i}, 

G't = CT{Yi} C a{X,,j G r(^) \ {{}} = a{X,J G V^{i)}. 

By translational covariance, if is sufficient to set r2 = {0}, |ri| = oo and dist{Ti,T2) > 
n+2i, for n > 1. Now, take B = {Xq = rj} for some fixed ry G A^o W , Since C J^Vo(n+i)'= 
and iJ,{B\Q^_^) = J^vb(„_|_^)c)|^p J, in order to bound ^^(n + 2£), it is sufficient to 

bound 

MB\g^j - f,m\oo < MBlJ'voin+ir) - KB)\\o.. 

But 

fi{B) = ii{ii{B\Fvo^n+iY))- 
Hence, using the specification 7 defined in (I2.ip and ()2.2p . by definition ()4.3p we have 

< sup jy [\lVo{n+t){B\u:) - lVo{n+t){BW)\] 

< sup [|7v/o(„+^)(B|w) - 7i4,(„+^)(B|w')l] 



(4.7) 



< sup |7yo(n+£)(w(^0°W)k) - lV,(n+iMVSm^')\- 

,^{V§(l)),^(Vo(n+lY),w'(VQ{n+lY) 

To control this last term Assumption 13.71 is essential. Indeed, we need to show that, 
uniformly on boundary conditions outside Vo(n + £), (|4.7p is exponentially small in n. 
Applying Theorem 3.1.3.2 of Presutti (2009), [26] . we obtain the following. There exists 
a function Uvb(ra+£) : — )■ M+ such that 

sup \^vo{n+iMySm^) - ivo(n+iMvsm^')\ < 

uj(V§{e)),w{VQ{n+tY),i^'(Vo{n+llY) 

^ UVo{n+i){i)- (4.8) 

Moreover by Corollary 3.2.5.5. of Presutti (2009), [26], under (l3:22]l . there exist c* = c*(L) 
and /c = k{L) so that 

«yo(n+£)(0 < c*e-'='^(*'^»(«+^)^),i G yo°W- (4.9) 



Therefore, we have 
and thus 



$^^i(n + 2£) < c*e~'="|yc° 



V^^h 



< |yo°(2^)l + E |5^o(n)Ki(n) 

. . (4.10) 

< (4£)'^ + c*(4^)'^ E n'^-^e^'^^"-^^). 

n>2£+l 



Immediately one gets ()4.6p . 
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Remark 4.3 In Proposition \4-l\ we obtain an exponential rate of convergence in the er- 
godic theorem. It is very likely that to our purposes polynomial or sub- exponential rates 
of convergence will be enough. This would allow to get the control for the probability of 
underestimation also in the regime of phase transition. This lies, however, outside the 
scope of the present paper. 



4.2 Deviation inequalities 

We are now able to state the deviation inequalities needed to control the probability of 
underestimation. They are consequences of Proposition 14.11 and follow ideas of Galves 
and Leonardi (2008), [17]. Before doing so, we define for any a ^ A, rj ^ ^^o'(^)^ 

p((a,r/)) _ M{^o = a,X,=r?,, Vi£yo"(^)}) 



By Assumption 13.41 we have that for any given configuration r/ G A^oi^)^ 
and 

Qmin • 

We are interested in configurations having support in a ball of radius at most L. Hence, 
writing 

ao = inf inf {p{a\v) , p{v)} , (4-12) 



we obtain that 



We define the following quantity 



«0>gS'. (4.13) 



An(r/) = f^^7in^logPn(a|r/) -p((r/,a))logp(a|r/) , (4.14) 
aeA ^ ^ 

where pn{a\r]) is the quantity defined in (j3.10p . We obtain the following deviation inequal- 
ities. 



Corollary 4.4 Let fi be a translation covariant Variable-neighborhood random field for 
which Assumption \3.4\ and Assumption 3.1 hold. Let t > 0, i < L where L is given in 
([3:22]) . let rj e A'^SW^ p^{.\'q) defined in [37W\) . p{-\r]) in (fill]) and let An{r]) as defined 
in ()4.14p . Then there exists a constant C{d,L) depending only on dimension and on L 
such that 



{\Pn{a\ri) - p{a\ri)\ > t) < le^l^exp \-C{d, L) 
^i (I A„(7?)| > t) < 3|^|ei/^ exp (-C{d, L) 



|A„|t^ao 



Va e A, 



4e 

\kn\{tM'^)al 
8|^|2(log2 ao Vl)e 



where uq is given in (j4.12p and estimated in (j4.13p . 



(4.15) 
(4.16) 
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Proof. Concerning (|4.15|) we obtain by inserting and subtracting the term |^" |p(^) ' 



\Pnia\v) -Pia[ 



< 



Nn{rj,a) Nn{r],a) 



+ 



1 fNn{v,a) 



p{v) 



Piia,v)) 



The first term in the last expression can be upper bounded by 



Nn{r],a) Nn{r],a) 



Nn{r],a 



\An\pir]) - Nniv) 



Nn{v)\An\p{v) 



p{v) 



< 



p{v) 



As a consequence we obtain that 



fJ- {\Pnia\r]) - p{a\r])\ > t) 

Nniv) 



p{v) 



|A„| 



p{{r],a)) 



Then, applying (j4.2p . we get 



4{2i) 



4(2L) 



Hence, writing C{d, L) = c{d, L)/(2L)^'^ ^, where c{d, L) is the constant of (|4.2|) . assertion 
(1^5]) follows. 

To show (I4.16P we subtract and add the term ^"i'?'") \ogp(a\'q) to A„(ry). We obtain 



An(r/) = 



' log P'^^'^^^^ 



|An| 



p{a\ri) 



+ Y,(^^^'j^-piiri,a))]logpia\v) 



We rewrite A^(?7) in the following way and then apply the estimate (j6.4|) : 

Nniv) sr^^ r^^^\^^„Pnia\v) 



^iiv) 



\A„ 



^Pnialv) log 



pia\v) 



^ Nniv) iPnia\v) -Pia\v))^ 



|A„ 

aeA 



aeA 



pia\v) 



iPnia\v) -Pia\v))^ 
pia\v) 

iPnia\v) -Pia\v))^ 
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Therefore 



m(^|A1(7?)|>- 

< ^ IJ- ({Pn{a\ri) -p{a\ri)f > T^^ao 

< 2\A\e'^^^ exp (-C{d,L) 



\A\2 



by (j4.15|) . We get for the second term 



|An|^ Qq 

8|^|e 



(4.17) 



t 



2^ 



> 



1 t 1 



1^1 2 I log ao I 



4\A\nog^aoeJ ' 



(4.18) 



by (|4.2p . This finishes the proof. 



5 Deviation inequalities for overestimation 

In order to control the probability of overestimation we do not need as strong assump- 
tions as for the control of the probability of underestimation. Indeed, we can avoid to 
impose Assumption 13.71 We mimic the method used by Csiszar and Talata (2006), [1], 
see Proposition 3.1 and Lemma 3.3 of their paper. Their results are typicality results and 
they obtain the almost sure convergence of the empirical probabilities to the theoretical 
ones. We follow the way indicated by Csiszar and Talata (2006), [1], but we quantify the 
errors and obtain in this way precise deviation inequalities. We will need only Assumption 

m 

We partition the region A„ by intersecting it with a sub lattice of Z'^ such that the 
distance between sites in the sub lattice is 4i?„ + 1. More precisely, let 

A^ = {j e An, j = k + {4Rn + 1)1, 1 G Z'^}, \\k\\ < 2Rn. 

For any i < and any fixed configuration rj G A^oW ^ let 

be the number of occurrences of rj in the sample having center in A^. In the same way we 
denote 

Note that we have 

fc:||fc||<2ij„ k:\\k\\<2Rn 

Let 

Ain,e,k) = {^logiV;^(7?) > log|A„|, for ah r/ G A^oi^) s.t. i > lo{rj)} (5.1) 
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and 

B{n,l)= n A{n,l,k). (5.2) 

fc:||fc||<2_R„ 

The probabilities ^{A{n, i, k)) and fj,{B{n, i)) can be immediately obtained by Lemma [5^31 
given at the end of this section. Recall the definition of p„ in ()3.9p . 



Theorem 5.1 For any 



6 > 2^^ log 1^1 



3e 



4g„ 



(5.3) 



there exist a positive constant c{6) = c(5, Qmin) o,nd uq (not depending on qmin nor on 5) 
such that for all n > no, for any i < Rn, 



Bin,' 



< 4 (4i?„ + 1)'=' exp -c(5)./log |A„| , 



(5.4) 



where k{5) > is as in (|3.18p . 

The main ingredient to prove Theorem 15.11 is the following lemma. 



Lemma 5.2 For any 6 as in (|5.3|) there exist uq (not depending on qmin nor on 5) and 
a positive constant c{6) = c{6, qmin), such that for all n > uq, for any i < Rn, 



3 7je A^o(^) J >lQ(r]) 



Ain,i, k) 



< 4exp -ci5)Jlog\An 



(5.5) 



Proof. Fix 7/ e ^^0°'^^) with £ > lo{ri) and set 7(a) = 7{o}(a|?7). Recall that 7(a) > q„ 
We first provide an upper bound for fixed rj of 



l^ 



iV^(r/,a)-iV^(ry)7(a) > ^6j{a)N}i{rj)(logNl;{rj)y/^ ,A{n,l,k) 



By definition 

N^i7],a)- N^{7]h{o}{a\r]) = ^ ^{x^=^} 

We order in some arbitrary way the points 



l{X,=a} -7{0}(«h) 



(5.6) 



{jeAtx' = 7j} = {ji,l<l<N^{rj)}. 



21 



Define 



Zi 



=4 -7{0}(a|r/) , / = 1, . . . iV^(??). 



The random variables {Z/, / = !,... Nll{r])} are identically distributed random variables 
with mean zero and conditionally independent, i.e. for i ^ j, < \zi\ < 1, < |zj| < 1 



Zi = Zi,Zj = Zj\uj{An\Uj^XkVj{e)) =H Zi = Zi\uj{An\Uj^XkVj{£)) 



Zj = zMAn\Uj^^kVj{£)) 



Take an independent copy {Z'-,i > 1} of i.i.d. random variables, having the same distri- 
bution as Zi, independent of X. Then for i > N^[r}) we let Zi = Z'^_j^^^_^y The important 
point of this definition is that in this way, the sequence of random variables Z^^Z^-, - ■ ■ is 
independent of N^(r]). Define partial sums 



N 



Sn = Zj, = max{Sj;j < N}. 

These are still independent of N^{ri). We write the quantity in ()5.6p as 



<(r?,a) - iV^(r/)7{o}(ak) = ^jv^-(^) < 



(5.7) 



We now use arguments similar to those in the proof of Lemma 3.3 of Csiszar and Talata 
(2006), [4|. In the following, 

/i = /i(-|a;(A„\U,.gA^y,-(£))) 

denotes always conditional probability when conditioning with respect to oj{An \ 
^i^K^ym- Then, 



(5.8) 



Note that on A{n,£,k)n{e^ < Nl;{r]) < e-'+^j, see ([SJ]), since log N^{r]) < log|A„| 



j <log|A„| < + 

Hence by independence of {S'^,N > 1} and NjKr]), the last expression of (jS.Sp can be 
bounded from above as follows. 



j:j<log|A„|<|(i + l) 



i:log|A„|<|(i + l) 



Now, Bernstein's inequality, see Lemma 19.21 yields 



(5.9) 



A[5^>c]<2exp(-^). 



This gives 



< 2exp 6Jj . 



22 



Taking into account that 



^ e-'yydy = -e''^^ + -e''^, 



setting b = 2qminS/e and a = § log |A„| — 1, one can upper bound the sum over j in (|5.9 
obtaining 



> ^J6J{a)N>i{r1){logN^{r))y/^ ,A{n,£,k) 



< 4 I a/^ log |A„| - 1 + 1 exp 2 ^ -^^ _ ^ 



since by the choice of 5 in (|5.3p . — < 1. 

Now, there exists no (not depending on Qmin HOP Oil (5) such. that for all n ^ tiq, this 
last upper bound can be replaced by 



-log|A„| - 1 + 1 I exp I — W-log|A„| - 1 I < exp ( -- — - — -v/log|A„ 



This upper bound also holds for the non-conditioned probability fi. Finally, in order to 
get the result uniformly over all possible configurations rj having lo{r]) < i, we need to 
sum over all possible choices of patterns rj. This gives, by definition of 



1^0° Wl = (2^)'' < 1^1(2^")" = gS'^loglyllVloglA^I 

terms. Thus we can conclude that for all n > uq, taking 5 as in (|5.3|) we have 



d ry G 



> 



(logivMii 



,A{n,£,k) 



< 4^2'' log 1^1 Vi^^ exp ( -^^^./log |A„ 



3 e 



4 exp -c(5)A/log I A, 



where c{6) = | ^Ssiii^ _ 2'^ log |^| > 0. This concludes the proof. • 
We are now able to give the proof of Theorem 15.11 

Proof of Theorem [57T] Fix rj £ A^oW with 1 > loiv), let 7(a) = ^{o}{a\v), ^ as in (fO 
and denote by 



A::||A:||<2i?„ 



^^1^ - 7(a)| < y<57(a)Kn^)]-Vl°g^'(^) 
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Then on Enirj), using Jensen's inequality, the definition of ii„ and N^{ri) < Nn{r]), 



Pn{a\rj) -7{0}(«h) 



fc:||A:||<2R„ 



]^-7m(a|.) 



< 



< 



r / I . \/logiV*;(r/) 

^ fe:||fc||<2fl„ "^'^ 

(4i?^ + l)^/^,5V^|o|(a|7?)V^[log jV„(r?)]V^ 
[iVn(r/)]i/2 

<5'^/2 (log|A„|)^5i/27|o}(«l^) 



[logiVn(r?)]^/^ 
[iVn(r/)]i/2 ■ 



(5.10) 



On {log |A„| < I log Nnirj)}, this last expression can be bounded from above by 



■2' y^'i"i> Nn(r,) 
where k{6) is chosen as in (j3.18p . Hence we get 



Nniv) ' 



_7?e^^O°(*):£>Zo(r;) 



(5.11) 



But 



U i^n(r?)^ 



u u 



7(a) 



therefore applying Lemma 15.21 we can finally upper bound 



U Er^ir^r ,B{n,i) 



< iiiRn + l)'*exp -c{6) Jlog\An\ , 



for all n > uq. This finishes the proof. 



The following lemma gives conditions ensuring that n{B{n, iY) converges to by giving 
the precise rate of convergence. 
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Lemma 5.3 For any < ei < 1, < e2 < 1, and for any positive Ci and C2 there exists 
'^o = ^o(9mm5 min(ei, £2), mm(Ci, C2)) so that for n > uq and for any i < Rn, we have 



(3r? G ^^o°W,^ > : Njlir,) < Ci\K\^~'') < exp (-C2|A„|^-^^) 



(5.12) 



Proof. Fix some rj with lo{r]) < £ < Rn- Then {l^(Xj), j G A^} is a cohection of 
conditional independent random variables, conditioned on fixing the configuration ix'(A„ \ 
^j(zXkVj{i)). By Assumption 13.41 we have that 



Here we have used that \Vq{ 



= {2£)'^. Then a conditional version of the Hoeffding 



inequality, see for example Lemma A3 in Csiszar and Talata (2006), [1], yields 



As a consequence, we obtain also for the unconditioned probability. 



Kir]) i(2ir 

_ |A^| 2^™^" 



(21)'' 



and thus 



{21)°- 



(5.13) 



(5.14) 



(5.15) 



To obtain ()5.12p we need to compare |A„| to |A^|. By construction we have for n suffi- 
ciently large. 



I A! I > 



|An| 



> 



|A, 



(4ii„ + l)'^ - {bRn^' 



This and (j5.15p imply that 



But 



1 (2e)'i |A„| 



{5Rn 



_ I ^min 

g (5H„)d 16 



^^'^ {bRnY - ' "'(5i?„)'^- 



(5.16) 



(5.17) 



(5.18) 



By the definition of i?„ in (j3.15p , Rf^ = ^/\og\K^\. Thus for any C > and for any e > 
there exists uq = nQ{qmin, e, C) so that for n > no. 



|A, 



i^RnY 



2''^\og |A„| log 

\AJ'- . . _ > C. 



5Vlog|An| 



This and (|5.17p imply that for any ei > and €2 > 0, positive Ci and C2, for n > no, we 
have 

^ [37? e ^^o^'W, I > /o(^) : N^ir,) < Ci\An\'-''] < |^|(2«n)'*e-2^^l^"l'"^^ (5.19) 

Finally, note that for n > no, 

|_4|(2iJ„)'' ^ g2<' log|.4|Vlog|An| < gC2|A„|i-^2_ 

Thus we have proved the lemma. • 
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6 Proof of Theorem 13.5 



We show the probabihty of overestimation ()3.20p . RecaU the definition of the set 13{n, Rn) 
given in (|5.2p . Clearly, 



+ i G A„ : > li{a), B{n, Rn)). 



(6.1) 



The first term is estimated by Lemma [531 choosing ei = |,e2 = e, Ci = ^,C2 = 2. This 
yields 

M((^(n,i?„))'^) < (4i?„ + 1)V2|^"I'"^ 

for all n > no where no depends on the choices ei = |,e2 = e,Ci = 1,(72 = 2 and qmin- 
Since 

(4i?„ + 1)'^ < C{d)Jlog\An\ < C(d)el^"l'"\ 



eventually, we have that for all n > tiq 

^i{{B{n,Rn))')<C{d)e-\~^-\'-\ 



(6.2) 



We now study the last term of ()6.ip . We are interested in the event {ln{i) = ( > 
li{a)}. Note that i > li{a) implies that for any j such that X^~^ = (^f"^; necessarily 
lj{a) = li{a) < i — 1 and as a consequence jj{-\Xj~'^) = 7i(-|o"f~^). 

Hence, for any £ > li{a), we have, by (|3.14p 



log Ln{i,i) 



E 



7 A . =cr 



< 



E 



1 



E 



1 



[log MPLnU, e) - log MPLnii, e - 1)] 

logMPL„(j,£) - logPL(:'^-i)(7.(-k^i)) 



^ p.(a|X|)log 



A^„(Xj,a)log 



Pn(a|X^O 
l^{a\at') 



< 



E E 

E E 

1 I 



[pMX^) - 7^ia\at')y 



l^Hat') 



Pn{a\X^) -jj{a\cj{a)) 



lj{a\cj{cr)) 



(6.3) 



We used that for any two probability distributions P and Q on j4, 

{P{a) - Q{a)f- 



En<.)io.e^i<j:, 



Q{a) 



(6.4) 
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(see Csiszar and Talata [S]), and in the last line the fact that {Xj ^ = cr^ > h{(^)} 
implies ^j{a\cj{a)) = 7j(a|cj^~^). Hence, writing for short 7j(-) = ^j{-\cj{a)), define 

Ee = 

Vj G An,lj{a) < £,ya £ A: p„(a|crj) - 7j(a) 



< 



K((5)7j(a; 



where 6 is as in (j5.3|) and k{6) is defined in (|3.18|) . Then on E^, (j6.3|) can be bounded 
uniformly in i G A„ from above by 



< |^|l^^«Wllog|A„ 
Hence, on Ee, for all i G A„ : We have for all £ > /j(cr), 

logL„(i,^) < K(5)|^||^|l^^»(^)llog|A„| =pen(^,n). 
This implies that Inii) < by definition of the estimator. Thus 



Rn 



H{3ie An ■■ in{i) > li{a),B{n,Rn)) < KEe,l3{n,Rn)). 



But 



E^ C 



3aeA,3rje W : Zo(^) < ^, |^„(a|r?) - 7{o}(«h)| > ^k(5)7{o} 



Hence by Theorem 15. H for n > no, we have 



5^;u(^;,^^(n,i^„)) < |^|C((i)i?f ^exp -c((^)^log|A. 



By definition of Rn, 



i?^'<(logAn)^. 



(6.5) 



(6.6) 



Taking into account (j6.ip . ()6.2p . (|6.5p and (|6.6p we get (|3.20p . This finishes the proof of 
Theorem 13.51 



7 Proof of Theorem 13.9 



We now turn to the problem of underestimation. We suppose that no is sufficiently large 
such that Rn > L. Fix i G A„ and suppose that /„(i) < /j(o"). Since /^(o") < L < Rn, this 
implies by definition of the estimator that for £ = li{cr), 



\ogLnii,£) < pen{£,n). 



(7.1) 
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Recall that by (^J^, 



log Lnii,i) 



E 



7 A =G 



log MPLnUJ) 



\ogMPKii,e-i). 



By definition of A„(r/) in ()4.14p we can write 



Set 



|An| 



log Lnii,i 



D{i,£,a) 



\ 



i_ e-i J 



-A„(X|) 



An (a; 



- p{{at\ a)) log p{a\at') 
aeA 

-5;p((a^^a))logp(a|a^l). 

a&A 



Yl YP^^'^i a)) log p{a\al 
ug^avoW aeA 

-Y,Pii^t\^mogp{a\at'). 

aeA 



Moreover, for a constant t > that will be chosen later, define 
Eit,e) = {V 7? G V V G : |A„(r?t;)| < * ^ 



(7.2) 



2 l^ll'SVbWI 



|A„(r?)| < t/2}. 



Then on E(t, > 



log Ln{i,e) > D{i,i,a)-t. 



Next we show that D{i, I, a) can be bounded away from zero. Taking into account 

5] p{{at\v,a))=p{{at\a)), 
veA^^oW 



we can write 



Dii,e,a)= ^ ^p((cT^\^;,a))log 



veA^voit) aeA 
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By Pinsker's inequality for relative entropy (see for example Fedotov et al. (2003), [13]), 
we have that for P and Q probability distributions on A, 



P(a) , „,,2 



Q{a) '- 2"- 
Moreover, 

llP-Qllw >sup(P(a)-Q(a))2. 

a 

But, since I = li{a), there exist v and a such that p{a\ol~^) ^ p{a\al~^ ,v). Hence we have 
that D{i,i,a) > 0. Since we are working under the assumption that /j(a;) < L for all uj 
(recall condition ()3.22p of Assumption I3.7P , we can thus conclude that 

D{i,i,a)>do>0, (7.3) 

where 

do = inf D{i, li{cr), a) > 0. 
Choosing now t = ^, we finally obtain that on E{^,i), for i = k{a), 

log Ln{i,i) — pen{£,n) > l-^nl"^ — pen{£,n) > 

for n > no{i), since pen{i,n) = log |A„| = 0(log |A„,|). This is in contradic- 

tion to (|7.ip and implies that ln{i) > hi^')- Hence we conclude that 



fi[3i G A„ : < k{a)] < ^ (y'^) 



(7.4) 



We use (j4.16p and sum over all possibilities of choosing rj E A^oi^ i) and of choosing 
patterns X| such that Xj^^ = ry, which gives terms. But since for i < L, 

|^||VbWI < |^|(2i+i)^^ we finally obtain 



<3\A\e'/' f|^|(2^+i)')exp(-C(d,L)- 



\An\doal 



8|^|2[log2 ao]|^|l^^oWle 



< 3|^|eV^ f exp f Cjd, L)\K\doa^ 



8|^|2[log2 ao]|^|l^^°(^)le 
3\A\e'/' (|^|(2^+i)')exp(-C2(d,L,g™„)|A„|) 



(recall the control of uq given in (j4.13p ). where C2{d,L,qmin) is another constant depend- 
ing on the dimension d, on the interaction range L and on qmin- 

Thus, we can conclude that for any < e < 1 there exists no = no{e,qmin, L,d) such 
that for all n > uq, 



(7.5) 



(j7.4p and (j7.5p together conclude the proof of Theorem [37 
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8 Final comments 



We generalize the concept of chains with memory of variable length to the multidimen- 
sional case of random fields. The main aim of this concept is to adopt a parsimonious way 
of describing data: the symbol at site i is influenced only by a random set of symbols, 
the set depends on the observed data. As in the case of one dimensional models, the set 
of relevant neighbor states of site i is called the context. The radius of the smallest ball 
containing the support of the context is the length of the context of site i. 

We presented in Section [3] an estimator of the context length function based on a 
sequence of local decisions between two possible context lengths. These decisions are 
performed using the log likelihood ratio function. In the case of dimension one, our 
estimator is simply the context length estimator of variable length chains which has been 
classically considered in the literature. We refer the interested reader to Galves and 
Locherbach (2008), [ilj, for a survey and bibliographic comments. 



At the end of Section [2] we argued that in order to estimate the context of a finite set 
of sites A (£ Z'^ it is sufficient to estimate the contexts of the one-point specification. In 
particular, we stated formula (|2.17p . which relates ca(w) to Ci{uj). In the first subsection 
of this appendix we show this. In the second subsection, we complete the computations 
for the example 12.101 Finally we state a deviation inequality needed in Section [5i 

9.1 From one point specifications to several points 

It is well known in Statistical Mechanics that the positive one point specification uniquely 
determines the family of specifications, see Theorem 1.33 of Georgii (1988), [20]. This 
result still holds for Variable-neighborhood random fields, since they can be embedded 
into classical random fields. But we would like to determine if and how the context of one 
single site determines the A— contexts of the specification, for any A ^ Z*^. Proposition 
19.11 gives an answer. 

We consider local specifications 7 which are positive, i.e. 



9 Appendix 



7a(wa|-)>0, for all cja G and A ^ Z"*. 



In the following it will be convenient to write 



7a({'^a}|o-) = QAiuJAcr). 



This family {£»a, A ^ Z'^ 



} is a family of functions qa '■ ^ ^ [0,^] satisfying the following 



two conditions: 




(9.1) 



and for every A C A d Z,'^, all lo, r],a in Q we have 



PA(wAO-A\Af''A'=) 



PA{uJA<yA'^) 

Pa(??ao-aO ' 



(9.2) 



PA(??ACrA\AO"A=) 
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Proposition 9.1 Assume that the family of local specifications 7 defined in (|2.6p is pos- 
itive LJ. We have the following: 

• ^ is uniquely determined by {7|j} (-104(0;)), i G I/^ijJ G il}. 

• For IJ^, 

spA(a;) = (UieAsp{i}(o;)) \ A. (9.3) 

Proof. Recall that we set 7|j}({a;(i)}|cj(a;)) = /9|j|(cj). Further /9|j}(a;) > for w G 
since we assumed that 7 is positive. For each fixed a;(i), Wjjjc 1— /9|j}(a;) is a measurable 
function with respect to -T^sp^;}, see (|2.4p . For each A, Georgii (1988), [20]; shows in the 
proof of Theorem 1.33 how to determine pt^ in terms of {/Oji},"* G Z"^} such that for any 
measurable function / we have that 

y f{uj)dp{uj) = J dp{ujM) J2 fi^)pAi^)^ VA d (9.4) 

where p is any measure on so that 

/ f{^)dl^{^) = i dp{uj{iyc) ^ f{uj)pi{uj). 

•J J . . . . ^ A 



This immediately shows that 7 is uniquely determined by /0{i}. To construct p\ and to 
prove (|2.17|) . one proceeds by induction on |A|. The case |A| = 1 is trivial. Suppose 
then that pAi and p\^ have been constructed. Let A be the union of two disjoint sets, 
A = Ai U A2, Ai n A2 = 0. Define 

M(-) = r. (9.5) 

By induction, for any given uj\^ , uja^ ^ PAi i^) = PAi {^Ai 1 f^A^) depends only on spA^ (uj), 
and for any given cjaj, ^a^ ^ /OA2 ('^A2 1 f^A^ ) only on spA2('^)- Hence (|9.5p implies that 
for any given uja, the function loa i— )■ pa{^) depends, by construction, on the cr— algebra 
generated by spAi(w) U IJ^^^ spA2('^Ai<^A'j)- Note that in general spAi(w) H spA2(w) / 0. 
Therefore the value of wai might be relevant for determining spA2('^) and the value of 
a;A2 might be relevant for determining spAi(<^). To have a function pa{^) measurable for 
any choice of oja we set 

spA(a;) = U^Ai Ua;A2 ('^PAi(^) U spA2(w)) \ (Ai U A2). 

In this way, for any choice of wa, Pa(w) is /gpA— measurable. It is immediate to verify by 
induction that one has 

spA(t^) = [Uo^A Ui6A sp{i}(u;)] \ A. 
We need to show that ()9.4p holds. By induction, taking in account that oj = (was.,wa^); 

/ f{uj)dp{uj) = j dp{uJAi)Y,f{uj)pAM^ k = l,2 (9.6) 



^The positivity requirement can be relaxed, under some minor modifications of the proof, see Georgii (1988), 
D, Theorem 1.33. 
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holds. To show that this holds for p\ take a positive measurable function / defined on Q,. 
We have 



PA2(^A2^Ai)PA2 (^A2'^Ai) J2 fi^Ai^A2^A'= 



But applying (|9.6|) first to A2, then to Ai, this last line can be written as 
j dfJ-iQ) ^ /)A2('^A2'^A§)Pa2 ('^AaWA^) ^ /(WA1WA2WAO 



^/(cJAi^A-) 



X]/('^AiWA-) 



C^/^l'^) J2 /('^AiWAf)pAi (a;Ai^^A^)PAi ('^Ai^^A^) 

t<JAi 

= / C?/^(w)^/(wAiWA=)/>Ai(wAi^^A5)pXHwAit^A5) 

Applying once more ()9.6p . we obtain 

(i^(wA^)^/(a;)pAi(^^)PA^(^) = / f^/^('^)/(^)PA^(^)- 



l^A, 



By applying the above equality to f{uj)p\{uj) instead of f{uj) we get the result. The above 
definition of p\ depends on the choice of Ai and A2; one needs to obtain an unambiguous 
definition of pA to choose a definite strategy to exhausting A site by site. • 



9.2 Continuation of the example 12.101 



Continuation of example 12.101 

We prove formula (j2.14p . using (|2.13p . First note that i G implies that \\i — j\\ < L. 

Now if i e rj(a;), we have two cases. Either i € Tj{uj), in which case rj{uj) = T^{uj). Or 
i G 9rj(w). Then uj{i)i = 1, and in this case, i G Fj^uj'^). Then the same arguments as 
above show that 

Hence 
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Finally, by definition of Tj{uj) 



U {r,(^) : i e T,{u;)} U |J {r,(^^) : i G r,(a;^)} 



3 




U{r]M:zGr,H}u U{r](a;^):iGr,(c.^)} n J VjiL) 




= r}{uj)nVi{2L). 



This concludes the proof. 



We close with the following version of Bernstein's inequality obtained by Friedman 
(1975), for discrete-time martingales having bounded jumps, see for instance Dzhaparidze 
and van Z ant en [11]. 

Lemma 9.2 Let Mn = £,! + ■■ •+Cn ^6 a discrete martingale with respect to some filtration 
{J-n)n>o having bounded jumps < a. Let 
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